-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup process_uc_wildcards
#193
Conversation
…tional, e.g. ctslvl and ctype from ~FI_COMM
benchmarks always compare to HEAD of main branch
# Conflicts: # .gitignore # pyproject.toml # utils/run_benchmarks.py # xl2times/__main__.py # xl2times/datatypes.py # xl2times/excel.py # xl2times/transforms.py # xl2times/utils.py
obscure _match_uc_wildcards bugfix
btw, if you convert it to draft PR the CI will run on every push, as well. |
xl2times/utils.py
Outdated
def remove_positive_patterns(pattern): | ||
return ",".join([word[1:] for word in pattern.split(",") if word[0] == "-"]) | ||
|
||
|
||
@functools.cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caches old results instead of recalculating the wildcard patterns that are often repeated across tables. Seems to give a small (a couple of percent) speedup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Just wondering how large the cache gets for a big model like Ireland/Austimes? Wondering if we'd need to use @functools.lru_cache(maxsize=...)
at some point instead. Perhaps something to keep in mind if memory usage is ever a problem.
organised imports, appeased linters
process_uc_wildcards
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks again!
xl2times/utils.py
Outdated
def remove_positive_patterns(pattern): | ||
return ",".join([word[1:] for word in pattern.split(",") if word[0] == "-"]) | ||
|
||
|
||
@functools.cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Just wondering how large the cache gets for a big model like Ireland/Austimes? Wondering if we'd need to use @functools.lru_cache(maxsize=...)
at some point instead. Perhaps something to keep in mind if memory usage is ever a problem.
|
||
summary = run(parse_args(args)) | ||
# Call the conversion function directly | ||
summary = run(parse_args(args)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, thinking about it again, perhaps there is one reason to use subprocess -- at least in the CI should we check that xl2times
works as expected from the command line? But on the other hand, run(parse_args(
is pretty much the same as the CLI invocation, and CI is probably faster without subprocess...
I'm undecided, so would love your thoughts. And we can leave it as is in this PR and discuss in an issue, maybe?
with ProcessPoolExecutor(max_workers=max_workers) as executor: | ||
results = list(executor.map(run_a_benchmark, benchmarks)) | ||
if debug: | ||
# bypass process pool and call benchmarks directly if --debug is set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would love to have this documented in the CLI help for --debug
, thanks!
) | ||
matches = [ | ||
df.iloc[:, 0].to_list() if df is not None and len(df) != 0 else None | ||
for df in matches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious about this code: is matches
a list of DataFrames at the end of the previous statement? Perhaps the FIXME can be resolved if we change the unique_filters.apply(lambda row: ...
line to something like:
matches = [matcher(row, dictionary) for _index, row in unique_filters.iterrows()]
reused process/commodities maps updated --debug doc
# Conflicts: # xl2times/transforms.py
…eature/wildcard_speedup
Now I remember why I used subprocess to call `xl2times` from `run_benchmarks.py`: in the CI, when it switches to the main branch, it needs to run the main branch's version of the tool. But if we call the tool as a python function, I think some of the PR version of the tool remains in memory, and we don't get what we want. This looks to be the reason CI is failing on this PR #183 : https://github.com/etsap-TIMES/xl2times/actions/runs/8020431584/job/21910275023 (The error is that it can't find a file in `xl2times/config/...` that was added by the PR when it is running tests in the main branch.) This PR undoes the change from #193 that removed the use of subprocess. See also https://github.com/etsap-TIMES/xl2times/pull/193/files?diff=unified&w=0#r1498683705
Speeds up
uc_wildcard
processing by about 40x (8s -> 0.18s on Ireland benchmark) by omitting repeated regex matches, vectorising withmerge
and omitting some string processing.The difference isn't obvious in the benchmark times because wildcards are a relatively small portion of the total runtime for smaller models, but it should have a significant impact on full model runs.
This is ready for a review now :)